Predicting the pathogenicity of bacterial genomes using widely spread protein families

Background The human body is inhabited by a diverse community of commensal non-pathogenic bacteria, many of which are essential for our health. By contrast, pathogenic bacteria have the ability to invade their hosts and cause a disease. Characterizing the differences between pathogenic and commensal non-pathogenic bacteria is important for the detection of emerging pathogens and for the development of new treatments. Previous methods for classification of bacteria as pathogenic or non-pathogenic used either raw genomic reads or protein families as features. Using protein families instead of reads provided a better interpretability of the resulting model. However, the accuracy of protein-families-based classifiers can still be improved. Results We developed a wide scope pathogenicity classifier (WSPC), a new protein-content-based machine-learning classification model. We trained WSPC on a newly curated dataset of 641 bacterial genomes, where each genome belongs to a different species. A comparative analysis we conducted shows that WSPC outperforms existing models on two benchmark test sets. We observed that the most discriminative protein-family features in WSPC are widely spread among bacterial species. These features correspond to proteins that are involved in the ability of bacteria to survive and replicate during an infection, rather than proteins that are directly involved in damaging or invading the host.


S1 Previous Works
In this section, we review previous computational methods for pathogenicity classification. The methods can be divided into two main categories: Read-based methods and protein-content-based methods.
Read-based Methods. Read-based classification approaches use short genomic reads as raw input, and hence do not depend on assembled genomes or on coding sequences. Several tools were proposed for the detection of pathogens in metagenomic reads based on mapping the reads to a species, genus, or phylum of reference genomes, and/or based on their sequence composition homology [1]. However, these methods were mainly designed for taxonomic assignment using known species that are present in a reference database, rather than for pathogenicity predictions. In addition to the taxonomy-dependent methods, several taxonomy-agnostic read-based methods were designed specifically for pathogenicity predictions [2,3]. In 2017, Deneke et al. published PaPrBaG [2], an RF approach that uses classification features of two types: k-mer occurrence-based features and peptide features. PaPrBaG assigns pathogenicity prediction to each read in a genome sample, and computes the final prediction of the genome by averaging over all of its read-based prediction probabilities. A more recent read-based taxonomy-agnostic classification tool is DeePaC [3], which applies reverse-complement convolutional and recurrent neural networks to the classification task.
Protein-content-based Methods. Protein-content-based methods are classification approaches that require the availability of assembled and annotated genomes as they characterize a bacterial genome by the presence or absence of protein family members [4,5,6,7].
Iraola et al. [5] constructed an SVM-based model to classify a bacteria, based on its genome, as a human pathogen or a non-pathogen using protein families annotated as virulence genes. Thus, their method has the disadvantage of not taking into account other protein families that could be associated with either HP or NHP phenotypes. Other protein-content-based tools constructed new protein families instead of relying on existing ones [4,6,7]. Andreatta et al. [4] developed a method to predict human pathogenicity in γ-Proteobacteria. They clustered new protein families using BLAST, and identified protein families that distinguish between the two phenotypes by their enrichment in either pathogens or non-pathogens. The pathogenicity of a new bacterium was determined by the presence or absence of proteins that belong to these distinguishing protein families in its genome.
Building on the former method, Cosentino et al. [6] developed PathogenFinder, a similar method extended to handle a variety of bacterial taxonomic groups. In order to reduce the method's runtime, CD-HIT was used instead of BLAST in the protein family clustering step of the algorithm, which took four weeks for a dataset of 885 genomes.
Recently, Barash et al. [7] developed BacPaCS, a sparse-SVM-based method for bacterial pathogenicity classification. Barash et al. also used CD-HIT for the construction of protein families, but in order to reduce the runtime of the clustering phase and to scale up the method to larger training data, only the top 10% of the longest gene sequences in the training set were used for the construction of protein families. Although this selection reduced the estimated runtime from 8 months to 12 days (for 21,155 genomes), the clustering step still remained the most computationally expensive step of the training in all the aforementioned methods. Furthermore, when examining the top-scoring features of the BacPaCS classifier, as reported in the paper, it is evident that many of the top-ranking features appear in only a few genomes, which may hamper the classifier's ability to contribute to the general understanding of pathogenic lifestyle, or to predict the pathogenicity of novel bacterial species.
The number of available sequenced strains per bacterial species varies greatly, mainly due to a bias towards pathogen studies [2]. Thus, the distribution of strains per species in a given database may not represent their distribution in the real world or in future applications. Consequently, this may result in a training dataset and a classifier that are biased towards specific species. Existing read-based methods addressed this issue by selecting one strain per species for their training sets [2,3]. However, previous protein-content-based methods [4,6,7] did not properly address this normalization issue, and thus their generality may be at fault.

S2.1 Labeling Method
To identify the HP and NHP bacteria in our dataset, we initially followed Barash et al. [7] annotationbased pathogenicity classification method, as described below: 1. Genomes are labeled as HP if they satisfy any of the following criteria: (a) The 'Disease' field is not empty, and does not contain a commensal term, as defined below.
(b) One of the fields 'Isolation Source', 'Host Health', or 'Comments' includes an HP term. In addition, the same fields cannot include any of the NHP terms (the terms used for HP and NHP are presented below).
(c) A genome was manually verified as HP, by reviewing it in the literature.
2. Excluding the generated HP list, genomes are labeled as NHP if they satisfy any of the following criteria: (a) One of the fields 'Isolation Source', 'Host Health', or 'Comments' includes an NHP term.
(b) One of the fields 'Isolation Source', 'Host Health', or 'Comments' includes a weaker NHP term.

4
A manual examination of a random sample from the training set revealed that some of the genomes were mislabeled. These labeling mistakes were caused by using keywords that can both describe HP and NHP genomes. In addition, a manual examination of a random sample from the group of genomes that were labeled as inconclusive revealed that some genomes could be labeled by utilizing additional keywords and fields. Therefore, the following modifications to the annotation process were made: • 'Other Clinical' and 'Isolation Comments' were added to the list of relevant fields in 1(b) and 2(a).
• The word 'intensive' was removed from the HP terms.
• The words 'microbiome' and 'microbiota' were added to the NHP terms.
• All the weaker NHP terms were removed.
This rule-based labeling process correctly identifies HP genomes due to the pathogenic sample collection process. First, the PATRIC database utilizes information from standard medical practice in which clinical samples are taken from diseased individuals based on the illness symptoms. Thus, the decision of which sample to take (e.g., stool, urine, blood, nasal swab, sputum swab) is illnessrelated. Second, the samples are processed based on the suspected pathogens, i.e., cultured using specific conditions that will allow the growth of the suspected pathogen. Therefore, in the case of an identification of a bacterial strain, it can be medically referred to as the infectious agent.

S2.2 Manual Inspection of the WSPC Test Set Genomes
We manually inspected all the genomes in the WSPC test set (Section 2.1.3, main text) to ensure that their labels are correct by reviewing the associated PATRIC metadata. We verified a genome as HP if the isolation source is a diseased individual, and as NHP if the isolation source is a healthy individual or a probiotic supplement. Additionally, we performed a literature search to confirm the corresponding label.
The following strains could not be verified as HP or NHP: 1. The PATRIC metadata of a strain belonging to the species Escherichia marmotae (Genome ID: 1499973.23) indicated that it was collected as part of a study that sequenced metagenomes from urinary tract patients as well as from a control group. There was no indication to which of the two groups this strain belonged, and it was thus removed from the test set.
2. The PATRIC metadata of a strain belonging to an unclassified species of the genus Starkeya (Genome ID: 2666134.3) indicated that the host health status is diseased. However, this genus includes soil bacteria [8], and we were not able to find a study suggesting that Starkeya species could be pathogenic. Therefore, this genome was removed from the test set.
In addition, two strains were mislabeled by the automatic annotation. A strain of the species Bacillus clausii was labeled HP because of the keyword "disease" in the "Comments" field. The full sentence "Genome analysis of Bacillus clausii B619/R for evaluation of its health promoting and disease 5 preventing properties" indicates that it is a commensal probiotic bacterium, which is also supported by the literature [9]. Therefore, its label was corrected to NHP. A strain of the unclassified species Clostridium sp. C5-48 was labeled as HP because of the keyword "patient" in the isolation source field. However, this strain was collected from the feces of an alcoholic patient to study his metagenomic population in the colon, and a different database indicates that this strain is commensal [10]. Therefore, its label was corrected to NHP.

S2.3 Manual Verification of the BacPaCS Test Set Genomes
We manually curated the 100 genomes included in the BacPaCS test set using the metadata associated with each genome and the literature. We verified a genome label as HP if it was isolated from a diseased host (based on the PARTIC database entry), and if there was also evidence in the literature that the corresponding species or strain is pathogenic. We verified a genome label as NHP if it was isolated from a healthy host, and if the corresponding species or strain was also described in the literature as commensal or probiotic.
A list of all the genomes included in the original BacPaCS test set along with their verified labels, references to relevant studies, as well as an indication of whether each genome was included in Benchmark Test 2, is given in Table S2. In what follows, we summarize the modifications we made following the verification process.
1. A total of 18 strains belonging to the species Pseudomonas aeruginosa, Acinetobacter nosocomialis, Streptococcus sp. NPS 3089, Acinetobacter baumannii, Escherichia coli, and Enterococcus faecium were originally labeled as NHP, but were verified by us as HP as these strains were isolated from clinical samples or described in the literature as well-known pathogenic strains. This may explain the discrepancy between the pathogenicity annotations detected by Bartosezewitch et al. [3].
2. The labels of another six strains belonging to the species Fusobacterium nucleatum, Fusobacterium periodonticum and Rothia aeria, could not be verified by us as neither HP nor NHP: • The species Fusobacterium nucleatum included one strain that was collected from subgingival dental plaque (which may be an initiating factor in periodontal diseases [11]) and the species Fusobacterium periodonticum included four strains that were collected from the tongue or from dental plaque. For these genomes there was no indication in the corresponding metadata whether the hosts carried a periodontal disease. The species Fusobacterium nucleatum and Fusobacterium periodonticum are associated with a wide spectrum of human periodontal diseases [12,13]. Therefore, these strains could not be reliably labeled and were removed from the test set.
• The species Rothia aeria included one strain, which was isolated from an air sample of the living environment in the Mir space station [14]. Rothia aeria is described as an opportunistic periodontal pathogen that causes infections of immunocompromised patients and neonates, but its virulent features remain uncertain [14]. As it was collected from the 6 environment, it is not clear whether this specific strain can cause disease, and therefore it was removed from the test set.

S3.1 Evaluation Metrics
Some of the test sets used in this paper consist of more HP than NHP labeled genomes. Using a regular accuracy metric (the proportion of correct predictions in the test set) may result in misleading classification evaluation due to the imbalance between the two classes. Therefore, we used Sensitivity (true positive rate), Specificity (true negative rate), and Balanced Accuracy (BACC), which denotes the mean of sensitivity and specificity [15]. In addition, for ranking evaluation of WSPC, we used the areas under the precision recall (AUPR) [16], and the receiver operation characteristic (AUROC) [17] curves. Since AUROC considers the ranking of all predictions while accuracy only considers a single prediction threshold (i.e., 0.5) [18], we used AUROC for the feature selection parameter tuning (Section S3.2 and Section 2.4.3 in the main text). Note that in the case of a highly imbalanced dataset, AUPR is more informative than AUROC [19]. However, since there is only a slight imbalance between the classes of the validation set ( ratio of 2:1 HP to NHP), and as AUROC is more commonly used than AUPR in the general case, we opted to use AUROC for the purpose of parameter tuning. 7 S3.2 Feature Selection of the WSPC Classifier -Parameter Tuning Figure S1: Performance evaluation of the RF classifier using different values for the k and t parameters of the two-step feature selection process (Section 2.4.3, main text). We trained on the training set, and evaluated on the validation set. A. AUROC values achieved by the classifier, as a function of k -the number of features selected in the first feature selection step (based on their χ 2 scores, without correlation reduction). The maximum value was obtained for k = 450. B. AUROC values achieved by the classifier using different subsets of the 450 features selected in the first feature selection step, as a function of t -the clustering threshold. The maximum value was obtained for t = 0.18 resulting in a subset of 244 features, and is equal to the AUROC score value obtained before removing correlated features (t = 0).

S3.3 Mean Decrease Impurity Measure Computation for Feature Importance
The Mean Decrease Impurity (MDI) importance measure [20,21] of a feature of interest is computed by the Scikit-learn python package [22] as follows: 1. For each tree in the forest, the total decrease in Gini impurity in all the splits that use this feature, weighted by the proportion of samples reaching that split, is computed.
2. The resulting value in each tree is averaged over all trees in the forest Consequently, the MDI values for all features sum to one. Thus, an "important" feature is often selected for tree splits and yields a high decrease of Gini impurity, leading to a high MDI. To evaluate the feature importance of each PGFam feature in the final set of features, we computed its average MDI value using 100 RF classifiers with different random seeds (seeds 0-99) trained on the combined training and validation sets.

S5.1 A Detailed Description of Each of the Top HP PGFams
As a part of the feature selection process, we performed clustering based on a correlation measure between all pairs of features (i.e., genes), and then selected a PGFam representative from each cluster.
Interestingly, 14 out of the 15 PGFams that yielded highest MDI scores in the HP category ( Table 1 in the main text), each belong to a "singleton" cluster that contains a single member.
In what follows, we describe the biological function of each of the 15 PGFams in Table 1 from the main text. In addition, we provide a list of PGFams that belong to the same cluster as the PGFam ranked seventh (tRNA-modifying protein YgfZ).
1. Uroporphyrinogen III decarboxylase (EC 4.1.1.37) is an enzyme that catalyzes the fifth step in heme biosynthesis [23]. Heme is essential to the function of hemoproteins, which are involved in processes such as energy generation by the electron transport chain and detoxification of host immune effectors. Both heme acquisition and synthesis are important for pathogenesis [24].
Uroporphyrinogen decarboxylase was found to be important for the survival of Actinobacillus pleuropneumoniae, a pathogen swine, during infection [25].
Other PGFams in the corresponding cluster: None.
2. Dihydrolipoamide acetyltransferase (EC 2.3.1.12) is an enzyme component of the multienzyme pyruvate dehydrogenase complex, which has an important role in aerobic respiration pathways [26]. It has been shown that Dihydrolipoamide acyltransferase is critical for Mycobacterium tuberculosis pathogenesis [27], and for the colonization of Vibrio cholerae [28].
Other PGFams in the corresponding cluster: None.
3. Cytosol aminopeptidase PepA (EC 3.4.11.1). Aminopeptidases are enzymes that catalyze the cleavage of amino acids and are active in several essential cellular processes [29]. PepA transcriptionally regulates the carB gene, which playes multiple roles in the pathogenicity of Xanthomonas citri [30]. In addition, PepA mediates pH regulation of virulence genes in Vibrio cholerae [31].
Other PGFams in the corresponding cluster: None.

Protoheme IX farnesyltransferase is an enzyme involved in catalysing the conversion of heme B
to heme O, encoded by the gene ctaB. Heme O is incorporated into the electron transport chain as an electron acceptor, facilitating aerobic respiration and energy production [32]. The deletion of ctaB was observed to cause attenuated growth and virulence of Staphylococcus aureus [33], and it was also observed that this gene plays a critical role in the ability of S. aureus to secrete cytolytic toxins [34].
Other PGFams in the corresponding cluster: None.
These enzymes have been linked to virulence in a variety of bacteria [36]. The prevalence of MoCo-dependent enzymes in key bacterial pathogens, paired with the mounting evidence of their central roles in bacterial fitness during infection, suggest that they could be important future drug targets [37].
Other PGFams in the corresponding cluster: None.
6. Class 2 dihydroorotate dehydrogenase (DHODase) participates in the pyrimidine de novo biosynthesis pathway. The pyrimidine synthetic pathway plays essential roles in the pathogenesis and antibiotic resistance of P. aeruginosa and E. coli, and in the survival of the pathogen H. pylori.
Other PGFams in the corresponding cluster: None.
7. Uncharacterized tRNA-modifying protein YgfZ, participates in the synthesis and repair of ironsulfur (Fe-S) clusters. A mutation in YgfZ causes growth defects in Escherichia coli, particularly under oxidative stress, and lowers the activities several Fe-S enzymes [38].
Other PGFams in the corresponding cluster: found to provide a significant advantage for bacteria at osmotic and oxidative stress [39].
Other PGFams in the corresponding cluster: None.
9. YpfJ protein, zinc metalloprotease superfamily, is a protein that cleaves other proteins and uses zinc as a metal cofactor. Metalloproteases play multiple roles in virulence including the disruption of physiologically important host processes, release of nutrients such as metals from host metalloproteins, cleavage of host immune components, and interference with host immune signaling cascades [40].
Other PGFams in the corresponding cluster: None.

Threonine dehydratase biosynthetic (EC 4.3.1.19). Threonine dehydratase mediated isoleucine
biosynthesis is an important step in maintaining the metabolic pool of isoleucine, a branch chain amino acid. It has been shown that down-regulation of threonine dehydratase in the pathogen Mycobacterium tuberculosis increases its susceptibility to oxidative stress [41]. It also has been suggested that the genes for threonine biosynthesis are critical factors for the multiplication of Staphylococcus aureus in the blood [42]. Although S. aureus is usually commensal in the skin and the mucosa, its presence in the blood can lead to a bloodstream infection with a high fatality rate [43]. The enzymes belonging to the branch chain amino acid biosynthetic pathway in bacteria are promising drug targets due to the lack of a similar pathway in mammals, which would reduce related toxicity [44].
Other PGFams in the corresponding cluster: None.
Glutathione is an abundant antioxidant in bacteria, where it serves a key function in protecting the cell from the action of low pH, chlorine compounds, osmotic stresses, and reactive oxygen species (ROS) [45]. Glutathione reductase is one of the main enzymes involved in glutathione metabolism [45]. Recent studies suggested that generation of ROS acts as a common mechanism of antibiotics-induced bacterial death, thus inhibiting antioxidant systems such as the glutathione system may limit antibiotic resistance [46].
Other PGFams in the corresponding cluster: None.
12. Cell division integral membrane protein, YggT and half-length relatives. This is an unknown gene with the predicted function of a cell division integral membrane protein, and its gene symbol is YggT. YggT seems to play a role in osmotic stress tolerance in Escherichia coli [47]. It has been suggested that osmostress responsive systems contribute to the virulence potential of a number of pathogenic bacteria [48].
Other PGFams in the corresponding cluster: None.
To counteract this attack, some microbial pathogens express Cu, Zn superoxide dismutase enzymes [50]. This enzyme was shown to induce protection against oxidative stress and enhance the pathogenicity of Bacillus anthracis [51]. In addition, it was shown to contribute to the resistance of Mycobacterium tuberculosis against oxidative burst products generated by macrophages [52].
Other PGFams in the corresponding cluster: None.
14. Sulfur carrier protein FdhD. FdhD is required for formate dehydrogenase to be functional. Formate dehydrogenase was suggested to be involved in oxidative stress tolerance in E. coli [53].
Other PGFams in the corresponding cluster: None.
15. Deoxyribodipyrimidine photolyase. Photolyases are enzymes that repair UV-induced DNA lesions by using light energy. Interestingly, many human and plant pathogens contain photolyases [54].
Other PGFams in the corresponding cluster: None.    Table S3: A list of genomes included in the WSPC test set. 1 HP to NHP ratio among the labeled genomes of the corresponding species (before choosing a representative from each species). 2 References to literature or a database entry asserting the label and the "group" annotations the corresponding genome belongs to. 3 OPP: opportunistic.